Buying and selling used smartphones used to be something that happened on a handful of online marketplace sites. But the used and refurbished phone market has grown considerably over the past decade, and a new IDC (International Data Corporation) forecast predicts that the used phone market would be worth $52.7bn by 2023 with a compound annual growth rate (CAGR) of 13.6% from 2018 to 2023. This growth can be attributed to an uptick in demand for used smartphones that offer considerable savings compared with new models.
Refurbished and used devices continue to provide cost-effective alternatives to both consumers and businesses that are looking to save money when purchasing a smartphone. There are plenty of other benefits associated with the used smartphone market. Used and refurbished devices can be sold with warranties and can also be insured with proof of purchase. Third-party vendors/platforms, such as Verizon, Amazon, etc., provide attractive offers to customers for refurbished smartphones. Maximizing the longevity of mobile phones through second-hand trade also reduces their environmental impact and helps in recycling and reducing waste. The impact of the COVID-19 outbreak may further boost the cheaper refurbished smartphone segment, as consumers cut back on discretionary spending and buy phones only for immediate needs.
IDC (International Data Corporation) forecast predicts that the used phone market would be worth $52.7bn by 2023 with a compound annual growth rate (CAGR) of 13.6% from 2018 to 2023. This growth can be attributed to an uptick in demand for used smartphones that offer considerable savings compared with new models.
Build a linear regression model to predict the price of a used phone and identify factors that significantly influence it.
Data Description
The data contains the different attributes of used/refurbished phones. The detailed data dictionary is given below.
Data Dictionary
# this will help in making the Python code more structured automatically (good coding practice)
%load_ext nb_black
# Standard Library Imports
from pathlib import Path
# Installed packages
import pandas as pd
from pandas.api.types import is_string_dtype, is_numeric_dtype
pd.set_option("display.precision", 2)
# Removes the limit from the number of displayed columns and rows.
# This is so I can see the entire dataframe when I print it
pd.set_option('display.max_columns', None)
# pd.set_option('display.max_rows', None)
import seaborn as sns
sns.set(color_codes=True) # adds a nice background to the graphs
import matplotlib.pyplot as plt
%matplotlib inline
from ipywidgets import widgets
import numpy as np
# Detailed Profiling Widget
from pandas_profiling import ProfileReport
from pandas_profiling.utils.cache import cache_file
import scipy
# this library contains a large number of probability distributions as well as a growing library of statistical functions
import scipy.stats as stats
import warnings
warnings.filterwarnings('ignore')
# to split the data into train and test
from sklearn.model_selection import train_test_split
# to build linear regression_model
from sklearn.linear_model import LinearRegression
# to check model performance
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
# to build linear regression_model using statsmodels
import statsmodels.api as sm
# to build linear regression_model using statsmodels
import statsmodels.api as sm
Generating grammar tables from /Users/johnnoble/opt/anaconda3/lib/python3.8/site-packages/blib2to3/Grammar.txt Writing grammar tables to /Users/johnnoble/Library/Caches/black/19.10b0/Grammar3.8.8.final.0.pickle Writing failed: [Errno 2] No such file or directory: '/Users/johnnoble/Library/Caches/black/19.10b0/tmpu2ie5soa' Generating grammar tables from /Users/johnnoble/opt/anaconda3/lib/python3.8/site-packages/blib2to3/PatternGrammar.txt Writing grammar tables to /Users/johnnoble/Library/Caches/black/19.10b0/PatternGrammar3.8.8.final.0.pickle Writing failed: [Errno 2] No such file or directory: '/Users/johnnoble/Library/Caches/black/19.10b0/tmprucb65bu'
NOTE: I AM DOING AN INITAL EDA BEFORE ANY DATA TRANSFORMATIONS TO ESTABLISH SOME BASIC RELATIONSHIPS BETWEEN THE INDEPENDENT AND DEPENDENT VARIABLES
# Read in dataset
df = pd.read_csv("used_phone_data.csv")
df.head(5)
| brand_name | os | screen_size | 4g | 5g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | new_price | used_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Honor | Android | 23.97 | yes | no | 13.0 | 5.0 | 64.0 | 3.0 | 3020.0 | 146.0 | 2020 | 127 | 111.62 | 86.96 |
| 1 | Honor | Android | 28.10 | yes | yes | 13.0 | 16.0 | 128.0 | 8.0 | 4300.0 | 213.0 | 2020 | 325 | 249.39 | 161.49 |
| 2 | Honor | Android | 24.29 | yes | yes | 13.0 | 8.0 | 128.0 | 8.0 | 4200.0 | 213.0 | 2020 | 162 | 359.47 | 268.55 |
| 3 | Honor | Android | 26.04 | yes | yes | 13.0 | 8.0 | 64.0 | 6.0 | 7250.0 | 480.0 | 2020 | 345 | 278.93 | 180.23 |
| 4 | Honor | Android | 15.72 | yes | no | 13.0 | 8.0 | 64.0 | 3.0 | 5000.0 | 185.0 | 2020 | 293 | 140.87 | 103.80 |
# let's create a copy of the data to avoid any changes to original data
df = df.copy()
df.shape
(3571, 15)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3571 entries, 0 to 3570 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 brand_name 3571 non-null object 1 os 3571 non-null object 2 screen_size 3571 non-null float64 3 4g 3571 non-null object 4 5g 3571 non-null object 5 main_camera_mp 3391 non-null float64 6 selfie_camera_mp 3569 non-null float64 7 int_memory 3561 non-null float64 8 ram 3561 non-null float64 9 battery 3565 non-null float64 10 weight 3564 non-null float64 11 release_year 3571 non-null int64 12 days_used 3571 non-null int64 13 new_price 3571 non-null float64 14 used_price 3571 non-null float64 dtypes: float64(9), int64(2), object(4) memory usage: 418.6+ KB
df.columns
Index(['brand_name', 'os', 'screen_size', '4g', '5g', 'main_camera_mp',
'selfie_camera_mp', 'int_memory', 'ram', 'battery', 'weight',
'release_year', 'days_used', 'new_price', 'used_price'],
dtype='object')
# Easy way to spot some outliers and probably some errors
df.describe(percentiles=[0.01, 0.25, 0.5, 0.75, 0.99]).apply(
lambda s: s.apply("{0:.2f}".format)
)
| screen_size | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | new_price | used_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 3571.00 | 3391.00 | 3569.00 | 3561.00 | 3561.00 | 3565.00 | 3564.00 | 3571.00 | 3571.00 | 3571.00 | 3571.00 |
| mean | 14.80 | 9.40 | 6.55 | 54.53 | 4.06 | 3067.23 | 179.42 | 2015.96 | 675.39 | 237.39 | 109.88 |
| std | 5.15 | 4.82 | 6.88 | 84.70 | 1.39 | 1364.21 | 90.28 | 2.29 | 248.64 | 197.55 | 121.50 |
| min | 2.70 | 0.08 | 0.30 | 0.01 | 0.03 | 80.00 | 23.00 | 2013.00 | 91.00 | 9.13 | 2.51 |
| 1% | 3.49 | 0.30 | 0.30 | 4.00 | 0.25 | 337.84 | 48.00 | 2013.00 | 131.80 | 20.46 | 8.29 |
| 25% | 12.70 | 5.00 | 2.00 | 16.00 | 4.00 | 2100.00 | 140.00 | 2014.00 | 536.00 | 120.13 | 45.20 |
| 50% | 13.49 | 8.00 | 5.00 | 32.00 | 4.00 | 3000.00 | 159.00 | 2016.00 | 690.00 | 189.80 | 75.53 |
| 75% | 16.51 | 13.00 | 8.00 | 64.00 | 4.00 | 4000.00 | 184.00 | 2018.00 | 872.00 | 291.94 | 126.00 |
| 99% | 28.42 | 21.02 | 32.00 | 512.00 | 8.00 | 7812.00 | 588.70 | 2020.00 | 1085.30 | 949.95 | 597.18 |
| max | 46.36 | 48.00 | 32.00 | 1024.00 | 16.00 | 12000.00 | 950.00 | 2020.00 | 1094.00 | 2560.20 | 1916.54 |
sns.lmplot(x="screen_size", y="used_price", data=df)
<seaborn.axisgrid.FacetGrid at 0x7fa0b5ceb1f0>
sns.lmplot(x="ram", y="used_price", data=df)
<seaborn.axisgrid.FacetGrid at 0x7fa0b298cb80>
sns.lmplot(x="selfie_camera_mp", y="used_price", data=df)
<seaborn.axisgrid.FacetGrid at 0x7fa0b5f5d4f0>
sns.lmplot(x="int_memory", y="used_price", data=df)
<seaborn.axisgrid.FacetGrid at 0x7fa0b5ff3b80>
sns.lmplot(x="main_camera_mp", y="used_price", data=df)
<seaborn.axisgrid.FacetGrid at 0x7fa0b63c9af0>
sns.lmplot(x="battery", y="used_price", data=df)
<seaborn.axisgrid.FacetGrid at 0x7fa0b6459400>
sns.lmplot(x="weight", y="used_price", data=df)
<seaborn.axisgrid.FacetGrid at 0x7fa0b67b8610>
sns.lmplot(x="release_year", y="used_price", data=df)
<seaborn.axisgrid.FacetGrid at 0x7fa0b6850c40>
sns.lmplot(x="days_used", y="used_price", data=df)
<seaborn.axisgrid.FacetGrid at 0x7fa0b6bebaf0>
sns.lmplot(x="new_price", y="used_price", data=df)
<seaborn.axisgrid.FacetGrid at 0x7fa0b6c47550>
# average used pricec over the years
plt.figure(figsize=(15, 7))
sns.lineplot(x="release_year", y="used_price", data=df, ci=None)
plt.show()
NumExpr defaulting to 8 threads.
profile = ProfileReport(
df,
title="Used Phone Data Pre Data Processing",
html={"style": {"full_width": True}},
sort=None,
)
profile.to_widgets()
# populate the list of numeric attributes and categorical attributes
num_list = []
cat_list = []
for column in df:
if is_numeric_dtype(df[column]):
num_list.append(column)
elif is_string_dtype(df[column]):
cat_list.append(column)
print(num_list)
print(cat_list)
['screen_size', 'main_camera_mp', 'selfie_camera_mp', 'int_memory', 'ram', 'battery', 'weight', 'release_year', 'days_used', 'new_price', 'used_price'] ['brand_name', 'os', '4g', '5g']
### 3. Univaraite Analysis ###
# bar chart and histogram
for column in df:
plt.figure(column, figsize=(4.9, 4.9))
plt.title(column)
if is_numeric_dtype(df[column]):
df[column].plot(kind="hist")
elif is_string_dtype(df[column]):
# show only the TOP 10 value count in each categorical data
df[column].value_counts()[:10].plot(kind="bar")
### 4. Multivariate Analysis ###
# correation matrix and heatmap
correlation = df.corr()
sns.heatmap(correlation, cmap="GnBu", annot=True)
<AxesSubplot:>
# pairplot
sns.pairplot(df, height=2.5)
# grouped bar chart
for i in range(0, len(cat_list)):
primary_cat = cat_list[i]
for j in range(0, len(cat_list)):
secondary_cat = cat_list[j]
if secondary_cat != primary_cat:
plt.figure(figsize=(15, 15))
chart = sns.countplot(
data=df,
x=primary_cat,
hue=secondary_cat,
palette="GnBu",
order=df[primary_cat].value_counts().iloc[:10].index, # show only TOP10
)
# pairplot with hue
for i in range(0, len(cat_list)):
hue_cat = cat_list[i]
sns.pairplot(df, hue = hue_cat)
# box plot
for i in range(0, len(cat_list)):
cat = cat_list[i]
for j in range(0, len(num_list)):
num = num_list[j]
plt.figure(figsize=(7, 7))
sns.boxplot(x=cat, y=num, data=df, palette="GnBu")
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
#### DROP the os column - 1) iOS as it is perfectly correlated with brand APPLE ##### and 2) users choose "brand" over os.
df["os"].value_counts()
Android 3246 Others 202 Windows 67 iOS 56 Name: os, dtype: int64
df.drop(["os"], axis=1, inplace=True)
# Do we have any missing data as a %
missing_count = df.isnull().sum() # the count of missing values
value_count = df.isnull().count() # the count of all values
missing_percentage = round(
missing_count / value_count * 100, 2
) # the percentage of missing values
missing_df = pd.DataFrame({"count": missing_count, "percentage": missing_percentage})
# create a dataframe
print(missing_df)
count percentage brand_name 0 0.00 screen_size 0 0.00 4g 0 0.00 5g 0 0.00 main_camera_mp 180 5.04 selfie_camera_mp 2 0.06 int_memory 10 0.28 ram 10 0.28 battery 6 0.17 weight 7 0.20 release_year 0 0.00 days_used 0 0.00 new_price 0 0.00 used_price 0 0.00
# Step 1: Fill the missing values with for all ALL NUMERIC COLUMNS - We are going to use the median value as so few values are missing from each column.
medianFiller = lambda x: x.fillna(x.median())
numeric_columns = df.select_dtypes(include=np.number).columns.tolist()
df[numeric_columns] = df[numeric_columns].apply(medianFiller, axis=0)
# Do we have any missing data as a %
missing_count = df.isnull().sum() # the count of missing values
value_count = df.isnull().count() # the count of all values
missing_percentage = round(
missing_count / value_count * 100, 2
) # the percentage of missing values
missing_df = pd.DataFrame({"count": missing_count, "percentage": missing_percentage})
# create a dataframe
print(missing_df)
count percentage brand_name 0 0.0 screen_size 0 0.0 4g 0 0.0 5g 0 0.0 main_camera_mp 0 0.0 selfie_camera_mp 0 0.0 int_memory 0 0.0 ram 0 0.0 battery 0 0.0 weight 0 0.0 release_year 0 0.0 days_used 0 0.0 new_price 0 0.0 used_price 0 0.0
# Let's look at the statistical summary of the data
pd.set_option(
"display.float_format", lambda x: "%.3f" % x
) # to display numbers rounded off to 3 decimal places
df.describe(include="all").T
### YOU CAN SEE A LOT OF RIGHT SKEWNESS look @ max vs 75% tile
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| brand_name | 3571 | 34 | Others | 509 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| screen_size | 3571.000 | NaN | NaN | NaN | 14.804 | 5.153 | 2.700 | 12.700 | 13.490 | 16.510 | 46.360 |
| 4g | 3571 | 2 | yes | 2359 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 5g | 3571 | 2 | no | 3419 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| main_camera_mp | 3571.000 | NaN | NaN | NaN | 9.330 | 4.705 | 0.080 | 5.000 | 8.000 | 13.000 | 48.000 |
| selfie_camera_mp | 3571.000 | NaN | NaN | NaN | 6.546 | 6.878 | 0.300 | 2.000 | 5.000 | 8.000 | 32.000 |
| int_memory | 3571.000 | NaN | NaN | NaN | 54.470 | 84.586 | 0.005 | 16.000 | 32.000 | 64.000 | 1024.000 |
| ram | 3571.000 | NaN | NaN | NaN | 4.057 | 1.390 | 0.030 | 4.000 | 4.000 | 4.000 | 16.000 |
| battery | 3571.000 | NaN | NaN | NaN | 3067.113 | 1363.063 | 80.000 | 2100.000 | 3000.000 | 4000.000 | 12000.000 |
| weight | 3571.000 | NaN | NaN | NaN | 179.384 | 90.197 | 23.000 | 140.000 | 159.000 | 184.000 | 950.000 |
| release_year | 3571.000 | NaN | NaN | NaN | 2015.965 | 2.292 | 2013.000 | 2014.000 | 2016.000 | 2018.000 | 2020.000 |
| days_used | 3571.000 | NaN | NaN | NaN | 675.391 | 248.641 | 91.000 | 536.000 | 690.000 | 872.000 | 1094.000 |
| new_price | 3571.000 | NaN | NaN | NaN | 237.389 | 197.546 | 9.130 | 120.130 | 189.800 | 291.935 | 2560.200 |
| used_price | 3571.000 | NaN | NaN | NaN | 109.880 | 121.501 | 2.510 | 45.205 | 75.530 | 126.000 | 1916.540 |
## 2: Feature Creation - Lets create a "totalmem'= Int + Ram since they are on the same scale
df["totalmem"] = df["ram"] + df["int_memory"]
# lets look at the the data to get a sense of the counts
df["totalmem"].value_counts()
20.000 1277 36.000 1005 68.000 444 132.000 152 136.000 106 134.000 104 35.000 60 34.000 49 70.000 49 18.000 41 512.250 33 256.250 32 260.000 28 264.000 22 67.000 22 17.000 14 32.250 11 9.000 11 128.250 10 268.000 10 1028.000 8 5.000 7 4.500 7 516.000 6 16.250 6 33.000 5 4.005 5 64.250 5 524.000 4 0.750 4 2.505 4 140.000 4 4.030 2 8.000 2 16.020 2 8.020 2 4.200 2 4.750 2 262.000 2 16.500 2 520.000 2 8.500 2 0.090 1 28.000 1 5.500 1 0.270 1 17.500 1 0.260 1 Name: totalmem, dtype: int64
# Drop Ram and Int_Mem columns in Df
df.drop(["ram"], axis=1, inplace=True)
df.drop(["int_memory"], axis=1, inplace=True)
# Let's look at the statistical summary of the data
pd.set_option(
"display.float_format", lambda x: "%.3f" % x
) # to display numbers rounded off to 3 decimal places
df.describe(include="all").T
### YOU CAN SEE A LOT OF RIGHT SKEWNESS look @ max vs 75% tile
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| brand_name | 3571 | 34 | Others | 509 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| screen_size | 3571.000 | NaN | NaN | NaN | 14.804 | 5.153 | 2.700 | 12.700 | 13.490 | 16.510 | 46.360 |
| 4g | 3571 | 2 | yes | 2359 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 5g | 3571 | 2 | no | 3419 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| main_camera_mp | 3571.000 | NaN | NaN | NaN | 9.330 | 4.705 | 0.080 | 5.000 | 8.000 | 13.000 | 48.000 |
| selfie_camera_mp | 3571.000 | NaN | NaN | NaN | 6.546 | 6.878 | 0.300 | 2.000 | 5.000 | 8.000 | 32.000 |
| battery | 3571.000 | NaN | NaN | NaN | 3067.113 | 1363.063 | 80.000 | 2100.000 | 3000.000 | 4000.000 | 12000.000 |
| weight | 3571.000 | NaN | NaN | NaN | 179.384 | 90.197 | 23.000 | 140.000 | 159.000 | 184.000 | 950.000 |
| release_year | 3571.000 | NaN | NaN | NaN | 2015.965 | 2.292 | 2013.000 | 2014.000 | 2016.000 | 2018.000 | 2020.000 |
| days_used | 3571.000 | NaN | NaN | NaN | 675.391 | 248.641 | 91.000 | 536.000 | 690.000 | 872.000 | 1094.000 |
| new_price | 3571.000 | NaN | NaN | NaN | 237.389 | 197.546 | 9.130 | 120.130 | 189.800 | 291.935 | 2560.200 |
| used_price | 3571.000 | NaN | NaN | NaN | 109.880 | 121.501 | 2.510 | 45.205 | 75.530 | 126.000 | 1916.540 |
| totalmem | 3571.000 | NaN | NaN | NaN | 58.526 | 84.708 | 0.090 | 20.000 | 36.000 | 68.000 | 1028.000 |
df.skew()
screen_size 0.858 main_camera_mp 0.787 selfie_camera_mp 1.824 battery 1.171 weight 3.147 release_year 0.320 days_used -0.342 new_price 3.739 used_price 4.749 totalmem 5.710 dtype: float64
df.kurtosis()
screen_size 1.212 main_camera_mp 5.012 selfie_camera_mp 3.478 battery 3.306 weight 12.869 release_year -1.219 days_used -0.683 new_price 26.472 used_price 41.151 totalmem 47.023 dtype: float64
# populate the list of numeric attributes and categorical attributes
num_list = []
cat_list = []
for column in df:
if is_numeric_dtype(df[column]):
num_list.append(column)
elif is_string_dtype(df[column]):
cat_list.append(column)
print(num_list)
print(cat_list)
['screen_size', 'main_camera_mp', 'selfie_camera_mp', 'battery', 'weight', 'release_year', 'days_used', 'new_price', 'used_price', 'totalmem'] ['brand_name', '4g', '5g']
# Research note
# iPhone 13 Pro Max 6.7-inch screen
# Samsung Galaxy S21 Ultra 6.8-inch screen.
# Galaxy Z Fold 3- 7.6-inch screen
# How many are there > 8inches or 20.32 centimeters
screen = df.loc[df["screen_size"] > 20.32].count()
screen
brand_name 509 screen_size 509 4g 509 5g 509 main_camera_mp 509 selfie_camera_mp 509 battery 509 weight 509 release_year 509 days_used 509 new_price 509 used_price 509 totalmem 509 dtype: int64
# Lets remove these from the used phone list ... they appear to be tablets based on the screen size.
df = df[df["screen_size"] <= 20.32]
df["totalmem"].plot(kind="hist")
<AxesSubplot:ylabel='Frequency'>
# can add custom labels
df["bintotalmem"] = pd.cut(
df["totalmem"], [-0.938, 257.068, 514.045, 771.023, 1028.0],
labels = ["bronze", "silver", "gold", "platnium"]
)
df["bintotalmem"].value_counts(dropna=False)
bronze 2980 silver 70 gold 7 platnium 5 Name: bintotalmem, dtype: int64
df.drop(["totalmem"], axis=1, inplace=True)
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 3062 entries, 4 to 3570 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 brand_name 3062 non-null object 1 screen_size 3062 non-null float64 2 4g 3062 non-null object 3 5g 3062 non-null object 4 main_camera_mp 3062 non-null float64 5 selfie_camera_mp 3062 non-null float64 6 battery 3062 non-null float64 7 weight 3062 non-null float64 8 release_year 3062 non-null int64 9 days_used 3062 non-null int64 10 new_price 3062 non-null float64 11 used_price 3062 non-null float64 12 bintotalmem 3062 non-null category dtypes: category(1), float64(7), int64(2), object(3) memory usage: 314.2+ KB
df["bintotalmem"].value_counts()
bronze 2980 silver 70 gold 7 platnium 5 Name: bintotalmem, dtype: int64
sns.set(rc={'figure.figsize':(11.7,8.27)})
labeled_barplot(df, "bintotalmem", perc=True)
df["used_price_log"] = np.log(df["used_price"])
df["new_price_log"] = np.log(df["new_price"])
df.drop(["used_price"], axis=1, inplace=True)
df.drop(["new_price"], axis=1, inplace=True)
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 3062 entries, 4 to 3570 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 brand_name 3062 non-null object 1 screen_size 3062 non-null float64 2 4g 3062 non-null object 3 5g 3062 non-null object 4 main_camera_mp 3062 non-null float64 5 selfie_camera_mp 3062 non-null float64 6 battery 3062 non-null float64 7 weight 3062 non-null float64 8 release_year 3062 non-null int64 9 days_used 3062 non-null int64 10 bintotalmem 3062 non-null category 11 used_price_log 3062 non-null float64 12 new_price_log 3062 non-null float64 dtypes: category(1), float64(7), int64(2), object(3) memory usage: 378.7+ KB
# populate the list of numeric attributes and categorical attributes
num_list = []
cat_list = []
for column in df:
if is_numeric_dtype(df[column]):
num_list.append(column)
elif is_string_dtype(df[column]):
cat_list.append(column)
print(num_list)
print(cat_list)
['screen_size', 'main_camera_mp', 'selfie_camera_mp', 'battery', 'weight', 'release_year', 'days_used', 'used_price_log', 'new_price_log'] ['brand_name', '4g', '5g']
Let's look at outliers in every numerical column
# let's plot the boxplots of all columns to check for outliers
plt.figure(figsize=(20, 30))
for i, variable in enumerate(num_list):
plt.subplot(5, 4, i + 1)
plt.boxplot(df[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
def treat_outliers(df, col):
"""
treats outliers in a variable
col: str, name of the numerical variable
df: dataframe
col: name of the column
"""
Q1 = df[col].quantile(0.25) # 25th quantile
Q3 = df[col].quantile(0.75) # 75th quantile
IQR = Q3 - Q1
Lower_Whisker = Q1 - 1.5 * IQR
Upper_Whisker = Q3 + 1.5 * IQR
# all the values smaller than Lower_Whisker will be assigned the value of Lower_Whisker
# all the values greater than Upper_Whisker will be assigned the value of Upper_Whisker
df[col] = np.clip(df[col], Lower_Whisker, Upper_Whisker)
return df
def treat_outliers_all(df, col_list):
"""
treat outlier in all numerical variables
col_list: list of numerical variables
df: data frame
"""
for c in col_list:
df = treat_outliers(df, c)
return df
# treating the outliers in every column
num_list = df.select_dtypes(include=np.number).columns.tolist()
df = treat_outliers_all(df, num_list)
# let's plot the boxplots of all columns to check for outliers
plt.figure(figsize=(20, 30))
for i, variable in enumerate(num_list):
plt.subplot(5, 4, i + 1)
plt.boxplot(df[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 3062 entries, 4 to 3570 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 brand_name 3062 non-null object 1 screen_size 3062 non-null float64 2 4g 3062 non-null object 3 5g 3062 non-null object 4 main_camera_mp 3062 non-null float64 5 selfie_camera_mp 3062 non-null float64 6 battery 3062 non-null float64 7 weight 3062 non-null float64 8 release_year 3062 non-null int64 9 days_used 3062 non-null float64 10 bintotalmem 3062 non-null category 11 used_price_log 3062 non-null float64 12 new_price_log 3062 non-null float64 dtypes: category(1), float64(8), int64(1), object(3) memory usage: 378.7+ KB
profile = ProfileReport(
df,
title="Used Phone Data EDA - Processed Data",
html={"style": {"full_width": True}},
sort=None,
)
profile.to_widgets()
We want to predict the used price.
Before we proceed to build a model, we'll have to encode categorical features.
We'll split the data into train and test to be able to evaluate the model that we build on the train data.
We will build a Linear Regression model using the train data and then check it's performance.
# defining X and y variables - used price log is the predicated so DROP them
X = df.drop(["used_price_log"], axis=1)
y = df["used_price_log"]
print(X.head())
print(y.head())
brand_name screen_size 4g 5g main_camera_mp selfie_camera_mp \
4 Honor 15.720 yes no 13.000 8.000
6 Honor 19.840 yes no 8.000 5.000
7 Honor 18.570 yes no 13.000 8.000
8 Honor 15.720 yes no 13.000 16.000
10 Honor 16.670 yes no 13.000 8.000
battery weight release_year days_used bintotalmem new_price_log
4 5000.000 185.000 2020 293.000 bronze 4.948
6 3020.000 144.000 2020 234.000 bronze 4.519
7 3400.000 164.000 2020 219.000 bronze 5.189
8 4000.000 165.000 2020 161.000 bronze 5.300
10 3020.000 150.000 2020 268.000 bronze 4.714
4 4.642
6 4.281
7 4.890
8 5.016
10 4.463
Name: used_price_log, dtype: float64
#### CHANGE CATEGORICAL & Object VALUES TO 1 or ZERO Using Pd.Dummies
X = pd.get_dummies(
X,
columns=X.select_dtypes(include=["category", "object"]).columns.tolist(),
drop_first=True,
)
X.head()
| screen_size | main_camera_mp | selfie_camera_mp | battery | weight | release_year | days_used | new_price_log | brand_name_Alcatel | brand_name_Apple | brand_name_Asus | brand_name_BlackBerry | brand_name_Celkon | brand_name_Coolpad | brand_name_Gionee | brand_name_Google | brand_name_HTC | brand_name_Honor | brand_name_Huawei | brand_name_Infinix | brand_name_Karbonn | brand_name_LG | brand_name_Lava | brand_name_Lenovo | brand_name_Meizu | brand_name_Micromax | brand_name_Microsoft | brand_name_Motorola | brand_name_Nokia | brand_name_OnePlus | brand_name_Oppo | brand_name_Others | brand_name_Panasonic | brand_name_Realme | brand_name_Samsung | brand_name_Sony | brand_name_Spice | brand_name_Vivo | brand_name_XOLO | brand_name_Xiaomi | brand_name_ZTE | 4g_yes | 5g_yes | bintotalmem_silver | bintotalmem_gold | bintotalmem_platnium | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4 | 15.720 | 13.000 | 8.000 | 5000.000 | 185.000 | 2020 | 293.000 | 4.948 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 6 | 19.840 | 8.000 | 5.000 | 3020.000 | 144.000 | 2020 | 234.000 | 4.519 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 7 | 18.570 | 13.000 | 8.000 | 3400.000 | 164.000 | 2020 | 219.000 | 5.189 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 8 | 15.720 | 13.000 | 16.000 | 4000.000 | 165.000 | 2020 | 161.000 | 5.300 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 10 | 16.670 | 13.000 | 8.000 | 3020.000 | 150.000 | 2020 | 268.000 | 4.714 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
# splitting the data in 70:30 ratio for train to test data
x_train, x_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.3,
random_state=42, # this number ensures the same rows are selected everytime...random seed you can replicate across runs.
)
print("Number of rows in train data =", x_train.shape[0])
print("Number of rows in test data =", x_test.shape[0])
Number of rows in train data = 2143 Number of rows in test data = 919
The last and final step, we've taken the categorical and objects and turned them into dummies (0 or 1) to ease the interpetation.
After splitting the data
# fitting the model on the train data (70% of the whole data)
from sklearn.linear_model import LinearRegression
linearregression = LinearRegression()
linearregression.fit(x_train, y_train)
LinearRegression()
Let's check the coefficients and intercept of the model.
coef_df = pd.DataFrame(
np.append(linearregression.coef_, linearregression.intercept_),
index=x_train.columns.tolist() + ["Intercept"],
columns=["Coefficients"],
)
coef_df
| Coefficients | |
|---|---|
| screen_size | 0.002 |
| main_camera_mp | 0.000 |
| selfie_camera_mp | 0.001 |
| battery | -0.000 |
| weight | 0.000 |
| release_year | 0.004 |
| days_used | -0.001 |
| new_price_log | 0.991 |
| brand_name_Alcatel | 0.002 |
| brand_name_Apple | 0.010 |
| brand_name_Asus | 0.022 |
| brand_name_BlackBerry | -0.002 |
| brand_name_Celkon | 0.043 |
| brand_name_Coolpad | 0.023 |
| brand_name_Gionee | -0.023 |
| brand_name_Google | 0.009 |
| brand_name_HTC | 0.016 |
| brand_name_Honor | 0.028 |
| brand_name_Huawei | 0.018 |
| brand_name_Infinix | 0.029 |
| brand_name_Karbonn | -0.016 |
| brand_name_LG | 0.014 |
| brand_name_Lava | -0.007 |
| brand_name_Lenovo | -0.004 |
| brand_name_Meizu | 0.012 |
| brand_name_Micromax | 0.001 |
| brand_name_Microsoft | -0.018 |
| brand_name_Motorola | 0.010 |
| brand_name_Nokia | -0.010 |
| brand_name_OnePlus | -0.029 |
| brand_name_Oppo | 0.006 |
| brand_name_Others | 0.006 |
| brand_name_Panasonic | -0.009 |
| brand_name_Realme | 0.058 |
| brand_name_Samsung | 0.012 |
| brand_name_Sony | 0.037 |
| brand_name_Spice | 0.014 |
| brand_name_Vivo | -0.005 |
| brand_name_XOLO | -0.001 |
| brand_name_Xiaomi | 0.016 |
| brand_name_ZTE | 0.007 |
| 4g_yes | -0.005 |
| 5g_yes | -0.048 |
| bintotalmem_silver | 0.002 |
| bintotalmem_gold | -0.168 |
| bintotalmem_platnium | -0.024 |
| Intercept | -8.826 |
Let's check the performance of the model using different metrics.
We will define a function to calculate MAPE and adjusted $R^2$.
We will create a function which will print out all the above metrics in one go.
# function to compute adjusted R-squared
def adj_r2_score(predictors, targets, predictions):
r2 = r2_score(targets, predictions)
n = predictors.shape[0]
k = predictors.shape[1]
return 1 - ((1 - r2) * (n - 1) / (n - k - 1))
# function to compute MAPE
def mape_score(targets, predictions):
return np.mean(np.abs(targets - predictions) / targets) * 100
# function to compute different metrics to check performance of a regression model
def model_performance_regression(model, predictors, target):
"""
Function to compute different metrics to check regression model performance
model: regressor
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
r2 = r2_score(target, pred) # to compute R-squared
adjr2 = adj_r2_score(predictors, target, pred) # to compute adjusted R-squared
rmse = np.sqrt(mean_squared_error(target, pred)) # to compute RMSE
mae = mean_absolute_error(target, pred) # to compute MAE
mape = mape_score(target, pred) # to compute MAPE
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"RMSE": rmse,
"MAE": mae,
"R-squared": r2,
"Adj. R-squared": adjr2,
"MAPE": mape,
},
index=[0],
)
return df_perf
# checking model performance on train set (seen 70% data)
print("Training Performance\n")
linearregression_train_perf = model_performance_regression(
linearregression, x_train, y_train
)
linearregression_train_perf
Training Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.091 | 0.074 | 0.986 | 0.985 | 1.847 |
# checking model performance on test set (seen 30% data)
print("Test Performance\n")
linearregression_test_perf = model_performance_regression(
linearregression, x_test, y_test
)
linearregression_test_perf
Test Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.097 | 0.078 | 0.984 | 0.983 | 1.922 |
Let's build a linear regression model using statsmodels.
# unlike sklearn, statsmodels does not add a constant to the data on its own
# we have to add the constant manually
x_train1 = sm.add_constant(x_train)
# adding constant to the test data
x_test1 = sm.add_constant(x_test)
olsmod0 = sm.OLS(y_train, x_train1).fit()
print(olsmod0.summary())
OLS Regression Results
==============================================================================
Dep. Variable: used_price_log R-squared: 0.986
Model: OLS Adj. R-squared: 0.985
Method: Least Squares F-statistic: 3117.
Date: Thu, 21 Oct 2021 Prob (F-statistic): 0.00
Time: 18:29:13 Log-Likelihood: 2104.3
No. Observations: 2143 AIC: -4115.
Df Residuals: 2096 BIC: -3848.
Df Model: 46
Covariance Type: nonrobust
=========================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------
const -8.8256 3.943 -2.238 0.025 -16.558 -1.093
screen_size 0.0019 0.001 1.641 0.101 -0.000 0.004
main_camera_mp 0.0002 0.001 0.246 0.806 -0.001 0.001
selfie_camera_mp 0.0015 0.001 1.990 0.047 2.18e-05 0.003
battery -1.994e-06 3.78e-06 -0.527 0.598 -9.41e-06 5.42e-06
weight 2.376e-05 0.000 0.219 0.826 -0.000 0.000
release_year 0.0043 0.002 2.203 0.028 0.000 0.008
days_used -0.0011 1.26e-05 -84.355 0.000 -0.001 -0.001
new_price_log 0.9907 0.005 195.775 0.000 0.981 1.001
brand_name_Alcatel 0.0017 0.020 0.087 0.931 -0.037 0.040
brand_name_Apple 0.0096 0.024 0.395 0.693 -0.038 0.057
brand_name_Asus 0.0224 0.020 1.145 0.252 -0.016 0.061
brand_name_BlackBerry -0.0020 0.029 -0.071 0.943 -0.058 0.054
brand_name_Celkon 0.0433 0.025 1.721 0.085 -0.006 0.093
brand_name_Coolpad 0.0229 0.032 0.726 0.468 -0.039 0.085
brand_name_Gionee -0.0226 0.022 -1.006 0.314 -0.067 0.021
brand_name_Google 0.0092 0.039 0.236 0.813 -0.067 0.086
brand_name_HTC 0.0163 0.020 0.824 0.410 -0.022 0.055
brand_name_Honor 0.0280 0.021 1.321 0.187 -0.014 0.069
brand_name_Huawei 0.0179 0.019 0.966 0.334 -0.018 0.054
brand_name_Infinix 0.0285 0.045 0.637 0.524 -0.059 0.116
brand_name_Karbonn -0.0157 0.027 -0.572 0.567 -0.069 0.038
brand_name_LG 0.0144 0.018 0.784 0.433 -0.022 0.050
brand_name_Lava -0.0074 0.025 -0.296 0.767 -0.057 0.042
brand_name_Lenovo -0.0039 0.019 -0.204 0.838 -0.041 0.033
brand_name_Meizu 0.0116 0.022 0.521 0.603 -0.032 0.055
brand_name_Micromax 0.0009 0.019 0.045 0.964 -0.037 0.039
brand_name_Microsoft -0.0181 0.030 -0.609 0.543 -0.076 0.040
brand_name_Motorola 0.0103 0.020 0.512 0.609 -0.029 0.050
brand_name_Nokia -0.0104 0.020 -0.518 0.605 -0.050 0.029
brand_name_OnePlus -0.0291 0.045 -0.652 0.515 -0.117 0.059
brand_name_Oppo 0.0060 0.020 0.302 0.763 -0.033 0.045
brand_name_Others 0.0061 0.017 0.350 0.727 -0.028 0.040
brand_name_Panasonic -0.0093 0.023 -0.397 0.691 -0.055 0.037
brand_name_Realme 0.0581 0.027 2.172 0.030 0.006 0.111
brand_name_Samsung 0.0118 0.018 0.663 0.508 -0.023 0.047
brand_name_Sony 0.0365 0.021 1.703 0.089 -0.006 0.079
brand_name_Spice 0.0145 0.026 0.554 0.580 -0.037 0.066
brand_name_Vivo -0.0055 0.022 -0.249 0.804 -0.049 0.038
brand_name_XOLO -0.0013 0.023 -0.059 0.953 -0.046 0.043
brand_name_Xiaomi 0.0157 0.022 0.727 0.468 -0.027 0.058
brand_name_ZTE 0.0068 0.020 0.344 0.731 -0.032 0.045
4g_yes -0.0046 0.006 -0.714 0.476 -0.017 0.008
5g_yes -0.0484 0.018 -2.623 0.009 -0.085 -0.012
bintotalmem_silver 0.0022 0.015 0.152 0.879 -0.027 0.031
bintotalmem_gold -0.1684 0.042 -4.000 0.000 -0.251 -0.086
bintotalmem_platnium -0.0238 0.042 -0.573 0.567 -0.105 0.058
==============================================================================
Omnibus: 274.646 Durbin-Watson: 2.034
Prob(Omnibus): 0.000 Jarque-Bera (JB): 676.987
Skew: -0.723 Prob(JB): 9.86e-148
Kurtosis: 5.344 Cond. No. 7.18e+06
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.18e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
We will be checking the following Linear Regression assumptions:
No Multicollinearity
Linearity of variables
Independence of error terms
Normality of error terms
No Heteroscedasticity
Multicollinearity occurs when predictor variables in a regression model are correlated. This correlation is a problem because predictor variables should be independent. If the correlation between variables is high, it can cause problems when we fit the model and interpret the results. When we have multicollinearity in the linear model, the coefficients that the model suggests are unreliable.
There are different ways of detecting (or testing) multicollinearity. One such way is by using the Variance Inflation Factor, or VIF.
Variance Inflation Factor (VIF): Variance inflation factors measure the inflation in the variances of the regression parameter estimates due to collinearities that exist among the predictors. It is a measure of how much the variance of the estimated regression coefficient 𝛽𝑘
is "inflated" by the existence of correlation among the predictor variables in the model.
If VIF is 1, then there is no correlation among the 𝑘
th predictor and the remaining predictor variables, and hence, the variance of 𝛽𝑘
is not inflated at all.
General Rule of thumb:
If VIF is between 1 and 5, then there is low multicollinearity.
If VIF is between 5 and 10, we say there is moderate multicollinearity.
If VIF is exceeding 10, it shows signs of high multicollinearity.
from statsmodels.stats.outliers_influence import variance_inflation_factor
# we will define a function to check VIF
def checking_vif(predictors):
vif = pd.DataFrame()
vif["feature"] = predictors.columns
# calculating VIF for each feature
vif["VIF"] = [
variance_inflation_factor(predictors.values, i)
for i in range(len(predictors.columns))
]
return vif
checking_vif(x_train1)
| feature | VIF | |
|---|---|---|
| 0 | const | 3966439.052 |
| 1 | screen_size | 3.310 |
| 2 | main_camera_mp | 2.426 |
| 3 | selfie_camera_mp | 3.379 |
| 4 | battery | 4.083 |
| 5 | weight | 3.309 |
| 6 | release_year | 4.474 |
| 7 | days_used | 2.196 |
| 8 | new_price_log | 2.900 |
| 9 | brand_name_Alcatel | 3.475 |
| 10 | brand_name_Apple | 2.264 |
| 11 | brand_name_Asus | 3.467 |
| 12 | brand_name_BlackBerry | 1.547 |
| 13 | brand_name_Celkon | 1.864 |
| 14 | brand_name_Coolpad | 1.415 |
| 15 | brand_name_Gionee | 2.179 |
| 16 | brand_name_Google | 1.263 |
| 17 | brand_name_HTC | 3.534 |
| 18 | brand_name_Honor | 2.808 |
| 19 | brand_name_Huawei | 5.223 |
| 20 | brand_name_Infinix | 1.190 |
| 21 | brand_name_Karbonn | 1.590 |
| 22 | brand_name_LG | 5.439 |
| 23 | brand_name_Lava | 1.781 |
| 24 | brand_name_Lenovo | 4.197 |
| 25 | brand_name_Meizu | 2.382 |
| 26 | brand_name_Micromax | 3.688 |
| 27 | brand_name_Microsoft | 1.458 |
| 28 | brand_name_Motorola | 3.342 |
| 29 | brand_name_Nokia | 3.453 |
| 30 | brand_name_OnePlus | 1.187 |
| 31 | brand_name_Oppo | 3.708 |
| 32 | brand_name_Others | 9.636 |
| 33 | brand_name_Panasonic | 2.047 |
| 34 | brand_name_Realme | 1.772 |
| 35 | brand_name_Samsung | 7.912 |
| 36 | brand_name_Sony | 2.674 |
| 37 | brand_name_Spice | 1.686 |
| 38 | brand_name_Vivo | 2.528 |
| 39 | brand_name_XOLO | 2.107 |
| 40 | brand_name_Xiaomi | 2.553 |
| 41 | brand_name_ZTE | 3.659 |
| 42 | 4g_yes | 2.507 |
| 43 | 5g_yes | 1.277 |
| 44 | bintotalmem_silver | 1.072 |
| 45 | bintotalmem_gold | 1.053 |
| 46 | bintotalmem_platnium | 1.023 |
To remove multicollinearity
Drop every column one by one that has a VIF score greater than 5.
Look at the adjusted R-squared and RMSE of all these models.
Drop the variable that makes the least change in adjusted R-squared.
Check the VIF scores again.
Continue till you get all VIF scores under 5.
Let's define a function that will help us do this.
def treating_multicollinearity(predictors, target, high_vif_columns):
"""
Checking the effect of dropping the columns showing high multicollinearity
on model performance (adj. R-squared and RMSE)
predictors: independent variables
target: dependent variable
high_vif_columns: columns having high VIF
"""
# empty lists to store adj. R-squared and RMSE values
adj_r2 = []
rmse = []
# build ols models by dropping one of the high VIF columns at a time
# store the adjusted R-squared and RMSE in the lists defined previously
for cols in high_vif_columns:
# defining the new train set
train = predictors.loc[:, ~predictors.columns.str.startswith(cols)]
# create the model
olsmodel = sm.OLS(target, train).fit()
# adding adj. R-squared and RMSE to the lists
adj_r2.append(olsmodel.rsquared_adj)
rmse.append(np.sqrt(olsmodel.mse_resid))
# creating a dataframe for the results
temp = pd.DataFrame(
{
"col": high_vif_columns,
"Adj. R-squared after_dropping col": adj_r2,
"RMSE after dropping col": rmse,
}
).sort_values(by="Adj. R-squared after_dropping col", ascending=False)
temp.reset_index(drop=True, inplace=True)
return temp
col_list = [
"brand_name_Huawei",
"brand_name_LG",
"brand_name_Others",
"brand_name_Samsung",
]
res = treating_multicollinearity(x_train1, y_train, col_list)
res
| col | Adj. R-squared after_dropping col | RMSE after dropping col | |
|---|---|---|---|
| 0 | brand_name_Others | 0.985 | 0.092 |
| 1 | brand_name_Samsung | 0.985 | 0.092 |
| 2 | brand_name_LG | 0.985 | 0.092 |
| 3 | brand_name_Huawei | 0.985 | 0.092 |
col_to_drop = "brand_name_Others"
x_train2 = x_train1.loc[:, ~x_train1.columns.str.startswith(col_to_drop)]
x_test2 = x_test1.loc[:, ~x_test1.columns.str.startswith(col_to_drop)]
# Check VIF now
vif = checking_vif(x_train2)
print("VIF after dropping ", col_to_drop)
vif
VIF after dropping brand_name_Others
| feature | VIF | |
|---|---|---|
| 0 | const | 3963518.688 |
| 1 | screen_size | 3.299 |
| 2 | main_camera_mp | 2.426 |
| 3 | selfie_camera_mp | 3.378 |
| 4 | battery | 4.082 |
| 5 | weight | 3.306 |
| 6 | release_year | 4.469 |
| 7 | days_used | 2.196 |
| 8 | new_price_log | 2.899 |
| 9 | brand_name_Alcatel | 1.218 |
| 10 | brand_name_Apple | 1.291 |
| 11 | brand_name_Asus | 1.217 |
| 12 | brand_name_BlackBerry | 1.065 |
| 13 | brand_name_Celkon | 1.124 |
| 14 | brand_name_Coolpad | 1.054 |
| 15 | brand_name_Gionee | 1.101 |
| 16 | brand_name_Google | 1.053 |
| 17 | brand_name_HTC | 1.253 |
| 18 | brand_name_Honor | 1.235 |
| 19 | brand_name_Huawei | 1.413 |
| 20 | brand_name_Infinix | 1.038 |
| 21 | brand_name_Karbonn | 1.068 |
| 22 | brand_name_LG | 1.413 |
| 23 | brand_name_Lava | 1.079 |
| 24 | brand_name_Lenovo | 1.268 |
| 25 | brand_name_Meizu | 1.169 |
| 26 | brand_name_Micromax | 1.241 |
| 27 | brand_name_Microsoft | 1.042 |
| 28 | brand_name_Motorola | 1.237 |
| 29 | brand_name_Nokia | 1.281 |
| 30 | brand_name_OnePlus | 1.036 |
| 31 | brand_name_Oppo | 1.366 |
| 32 | brand_name_Panasonic | 1.100 |
| 33 | brand_name_Realme | 1.139 |
| 34 | brand_name_Samsung | 1.632 |
| 35 | brand_name_Sony | 1.202 |
| 36 | brand_name_Spice | 1.073 |
| 37 | brand_name_Vivo | 1.217 |
| 38 | brand_name_XOLO | 1.101 |
| 39 | brand_name_Xiaomi | 1.172 |
| 40 | brand_name_ZTE | 1.253 |
| 41 | 4g_yes | 2.506 |
| 42 | 5g_yes | 1.277 |
| 43 | bintotalmem_silver | 1.072 |
| 44 | bintotalmem_gold | 1.053 |
| 45 | bintotalmem_platnium | 1.023 |
olsmod1 = sm.OLS(y_train, x_train2).fit()
print(olsmod1.summary())
OLS Regression Results
==============================================================================
Dep. Variable: used_price_log R-squared: 0.986
Model: OLS Adj. R-squared: 0.985
Method: Least Squares F-statistic: 3188.
Date: Thu, 21 Oct 2021 Prob (F-statistic): 0.00
Time: 18:29:14 Log-Likelihood: 2104.3
No. Observations: 2143 AIC: -4117.
Df Residuals: 2097 BIC: -3856.
Df Model: 45
Covariance Type: nonrobust
=========================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------
const -8.8630 3.941 -2.249 0.025 -16.591 -1.135
screen_size 0.0018 0.001 1.624 0.105 -0.000 0.004
main_camera_mp 0.0002 0.001 0.251 0.802 -0.001 0.001
selfie_camera_mp 0.0015 0.001 1.996 0.046 2.61e-05 0.003
battery -2.016e-06 3.78e-06 -0.533 0.594 -9.43e-06 5.4e-06
weight 2.489e-05 0.000 0.230 0.818 -0.000 0.000
release_year 0.0043 0.002 2.215 0.027 0.000 0.008
days_used -0.0011 1.26e-05 -84.374 0.000 -0.001 -0.001
new_price_log 0.9907 0.005 195.840 0.000 0.981 1.001
brand_name_Alcatel -0.0038 0.012 -0.330 0.742 -0.027 0.019
brand_name_Apple 0.0040 0.018 0.220 0.826 -0.032 0.040
brand_name_Asus 0.0169 0.012 1.458 0.145 -0.006 0.040
brand_name_BlackBerry -0.0076 0.024 -0.321 0.748 -0.054 0.039
brand_name_Celkon 0.0378 0.020 1.932 0.053 -0.001 0.076
brand_name_Coolpad 0.0174 0.027 0.637 0.524 -0.036 0.071
brand_name_Gionee -0.0281 0.016 -1.762 0.078 -0.059 0.003
brand_name_Google 0.0037 0.036 0.103 0.918 -0.066 0.073
brand_name_HTC 0.0107 0.012 0.912 0.362 -0.012 0.034
brand_name_Honor 0.0224 0.014 1.597 0.110 -0.005 0.050
brand_name_Huawei 0.0124 0.010 1.284 0.199 -0.007 0.031
brand_name_Infinix 0.0229 0.042 0.548 0.584 -0.059 0.105
brand_name_Karbonn -0.0211 0.022 -0.943 0.346 -0.065 0.023
brand_name_LG 0.0089 0.009 0.948 0.343 -0.009 0.027
brand_name_Lava -0.0130 0.020 -0.663 0.507 -0.051 0.025
brand_name_Lenovo -0.0094 0.010 -0.903 0.366 -0.030 0.011
brand_name_Meizu 0.0060 0.016 0.387 0.699 -0.025 0.037
brand_name_Micromax -0.0047 0.011 -0.414 0.679 -0.027 0.017
brand_name_Microsoft -0.0236 0.025 -0.941 0.347 -0.073 0.026
brand_name_Motorola 0.0047 0.012 0.385 0.700 -0.019 0.029
brand_name_Nokia -0.0160 0.012 -1.306 0.192 -0.040 0.008
brand_name_OnePlus -0.0347 0.042 -0.831 0.406 -0.117 0.047
brand_name_Oppo 0.0005 0.012 0.040 0.968 -0.023 0.024
brand_name_Panasonic -0.0148 0.017 -0.866 0.387 -0.048 0.019
brand_name_Realme 0.0525 0.021 2.449 0.014 0.010 0.095
brand_name_Samsung 0.0062 0.008 0.773 0.440 -0.010 0.022
brand_name_Sony 0.0310 0.014 2.154 0.031 0.003 0.059
brand_name_Spice 0.0089 0.021 0.430 0.667 -0.032 0.050
brand_name_Vivo -0.0110 0.015 -0.722 0.471 -0.041 0.019
brand_name_XOLO -0.0068 0.016 -0.415 0.678 -0.039 0.025
brand_name_Xiaomi 0.0101 0.015 0.693 0.489 -0.019 0.039
brand_name_ZTE 0.0012 0.011 0.104 0.917 -0.021 0.024
4g_yes -0.0046 0.006 -0.705 0.481 -0.017 0.008
5g_yes -0.0484 0.018 -2.625 0.009 -0.085 -0.012
bintotalmem_silver 0.0023 0.015 0.156 0.876 -0.027 0.031
bintotalmem_gold -0.1684 0.042 -4.000 0.000 -0.251 -0.086
bintotalmem_platnium -0.0236 0.042 -0.570 0.569 -0.105 0.058
==============================================================================
Omnibus: 273.910 Durbin-Watson: 2.034
Prob(Omnibus): 0.000 Jarque-Bera (JB): 674.218
Skew: -0.721 Prob(JB): 3.94e-147
Kurtosis: 5.339 Cond. No. 7.18e+06
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.18e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
# initial list of columns
cols = x_train2.columns.tolist()
# setting an initial max p-value
max_p_value = 1
while len(cols) > 0:
# defining the train set
x_train_aux = x_train2[cols]
# fitting the model
model = sm.OLS(y_train, x_train_aux).fit()
# getting the p-values and the maximum p-value
p_values = model.pvalues
max_p_value = max(p_values)
# name of the variable with maximum p-value
feature_with_p_max = p_values.idxmax()
if max_p_value > 0.05:
cols.remove(feature_with_p_max)
else:
break
selected_features = cols
print(selected_features)
['const', 'screen_size', 'selfie_camera_mp', 'release_year', 'days_used', 'new_price_log', 'brand_name_Celkon', 'brand_name_Gionee', 'brand_name_Realme', 'brand_name_Sony', '5g_yes', 'bintotalmem_gold']
x_train2 = x_train2[selected_features]
x_test2 = x_test2[selected_features]
olsmod2 = sm.OLS(y_train, x_train2).fit()
print(olsmod2.summary())
OLS Regression Results
==============================================================================
Dep. Variable: used_price_log R-squared: 0.985
Model: OLS Adj. R-squared: 0.985
Method: Least Squares F-statistic: 1.312e+04
Date: Thu, 21 Oct 2021 Prob (F-statistic): 0.00
Time: 18:29:14 Log-Likelihood: 2093.7
No. Observations: 2143 AIC: -4163.
Df Residuals: 2131 BIC: -4095.
Df Model: 11
Covariance Type: nonrobust
=====================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------
const -8.3329 3.450 -2.415 0.016 -15.099 -1.567
screen_size 0.0017 0.001 2.270 0.023 0.000 0.003
selfie_camera_mp 0.0014 0.001 2.083 0.037 8.3e-05 0.003
release_year 0.0041 0.002 2.372 0.018 0.001 0.007
days_used -0.0011 1.2e-05 -88.620 0.000 -0.001 -0.001
new_price_log 0.9929 0.004 265.860 0.000 0.986 1.000
brand_name_Celkon 0.0410 0.019 2.173 0.030 0.004 0.078
brand_name_Gionee -0.0308 0.015 -2.026 0.043 -0.061 -0.001
brand_name_Realme 0.0476 0.021 2.321 0.020 0.007 0.088
brand_name_Sony 0.0282 0.013 2.147 0.032 0.002 0.054
5g_yes -0.0488 0.017 -2.811 0.005 -0.083 -0.015
bintotalmem_gold -0.1655 0.042 -3.971 0.000 -0.247 -0.084
==============================================================================
Omnibus: 287.881 Durbin-Watson: 2.044
Prob(Omnibus): 0.000 Jarque-Bera (JB): 717.980
Skew: -0.751 Prob(JB): 1.24e-156
Kurtosis: 5.406 Cond. No. 3.74e+06
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.74e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
Now no feature has p-value greater than 0.05, so we'll consider the features in x_train2 as the final ones and olsmod2 as final mode
</div>
Now we'll check the rest of the assumptions on olsmod2.
Linearity of variables
Independence of error terms
Normality of error terms
No Heteroscedasticity
Why the test?
How to check linearity and independence?
How to fix if this assumption is not followed?
# let us create a dataframe with actual, fitted and residual values
df_pred = pd.DataFrame()
df_pred["Actual Values"] = y_train # actual values
df_pred["Fitted Values"] = olsmod2.fittedvalues # predicted values
df_pred["Residuals"] = olsmod2.resid # residuals
df_pred.head()
| Actual Values | Fitted Values | Residuals | |
|---|---|---|---|
| 386 | 5.268 | 5.165 | 0.103 |
| 2936 | 4.935 | 5.013 | -0.078 |
| 2105 | 3.993 | 3.898 | 0.095 |
| 3471 | 5.029 | 4.964 | 0.065 |
| 737 | 5.304 | 5.174 | 0.130 |
# let's plot the fitted values vs residuals
sns.residplot(
data=df_pred, x="Fitted Values", y="Residuals", color="purple", lowess=True
)
plt.xlabel("Fitted Values")
plt.ylabel("Residuals")
plt.title("Fitted vs Residual plot")
plt.show()
TEST FOR NORMALITY
Why the test?
Error terms, or residuals, should be normally distributed. If the error terms are not normally distributed, confidence intervals of the coefficient estimates may become too wide or narrow. Once confidence interval becomes unstable, it leads to difficulty in estimating coefficients based on minimization of least squares. Non-normality suggests that there are a few unusual data points that must be studied closely to make a better model.
How to check normality?
The shape of the histogram of residuals can give an initial idea about the normality.
It can also be checked via a Q-Q plot of residuals. If the residuals follow a normal distribution, they will make a straight line plot, otherwise not.
Other tests to check for normality includes the Shapiro-Wilk test.
Null hypothesis: Residuals are normally distributed
Alternate hypothesis: Residuals are not normally distributed
How to fix if this assumption is not followed?
We can apply transformations like log, exponential, arcsinh, etc. as per our data.
sns.histplot(data=df_pred, x="Residuals", kde=True)
plt.title("Normality of residuals")
plt.show()
import pylab
import scipy.stats as stats
stats.probplot(df_pred["Residuals"], dist="norm", plot=pylab)
plt.show()
stats.shapiro(df_pred["Residuals"])
ShapiroResult(statistic=0.9583384990692139, pvalue=2.593337430209968e-24)
Homoscedascity: If the variance of the residuals is symmetrically distributed across the regression line, then the data is said to be homoscedastic.
Heteroscedascity: If the variance is unequal for the residuals across the regression line, then the data is said to be heteroscedastic.
Why the test?
The presence of non-constant variance in the error terms results in heteroscedasticity. Generally, non-constant variance arises in presence of outliers.
How to check for homoscedasticity?
The residual vs fitted values plot can be looked at to check for homoscedasticity. In the case of heteroscedasticity, the residuals can form an arrow shape or any other non-symmetrical shape.
The goldfeldquandt test can also be used. If we get a p-value > 0.05 we can say that the residuals are homoscedastic. Otherwise, they are heteroscedastic.
Null hypothesis: Residuals are homoscedastic
Alternate hypothesis: Residuals have heteroscedasticity
How to fix if this assumption is not followed?
Heteroscedasticity can be fixed by adding other important features or making transformations.
import statsmodels.stats.api as sms
from statsmodels.compat import lzip
name = ["F statistic", "p-value"]
test = sms.het_goldfeldquandt(df_pred["Residuals"], x_train2)
lzip(name, test)
[('F statistic', 1.0169981039356226), ('p-value', 0.39192039307627247)]
Now that we have checked all the assumptions of linear regression and they are satisfied, we can move towards the prediction part.
# predictions on the test set
pred = olsmod2.predict(x_test2)
df_pred_test = pd.DataFrame({"Actual": y_test, "Predicted": pred})
df_pred_test.sample(10, random_state=1)
| Actual | Predicted | |
|---|---|---|
| 3343 | 5.059 | 5.123 |
| 2077 | 4.359 | 4.305 |
| 1996 | 3.988 | 4.132 |
| 313 | 4.194 | 4.079 |
| 427 | 3.481 | 3.409 |
| 1206 | 4.386 | 4.423 |
| 2285 | 4.788 | 4.787 |
| 281 | 5.343 | 5.240 |
| 1231 | 4.551 | 4.441 |
| 527 | 3.863 | 3.910 |
df1 = df_pred_test.sample(25, random_state=1)
df1.plot(kind="bar", figsize=(15, 7))
plt.show()
# checking model performance on train set (seen 70% data)
print("Training Performance\n")
olsmod2_train_perf = model_performance_regression(olsmod2, x_train2, y_train)
olsmod2_train_perf
Training Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.091 | 0.074 | 0.985 | 0.985 | 1.859 |
# checking model performance on test set (seen 30% data)
print("Test Performance\n")
olsmod2_test_perf = model_performance_regression(olsmod2, x_test2, y_test)
olsmod2_test_perf
Test Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.097 | 0.078 | 0.984 | 0.984 | 1.921 |
Let's compare the initial model created with sklearn and the final statsmodels model.
# training performance comparison
models_train_comp_df = pd.concat(
[linearregression_train_perf.T, olsmod2_train_perf.T], axis=1,
)
models_train_comp_df.columns = [
"Linear Regression sklearn",
"Linear Regression statsmodels",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Linear Regression sklearn | Linear Regression statsmodels | |
|---|---|---|
| RMSE | 0.091 | 0.091 |
| MAE | 0.074 | 0.074 |
| R-squared | 0.986 | 0.985 |
| Adj. R-squared | 0.985 | 0.985 |
| MAPE | 1.847 | 1.859 |
# test performance comparison
models_test_comp_df = pd.concat(
[linearregression_test_perf.T, olsmod2_test_perf.T], axis=1,
)
models_test_comp_df.columns = [
"Linear Regression sklearn",
"Linear Regression statsmodels",
]
print("Test performance comparison:")
models_test_comp_df
Test performance comparison:
| Linear Regression sklearn | Linear Regression statsmodels | |
|---|---|---|
| RMSE | 0.097 | 0.097 |
| MAE | 0.078 | 0.078 |
| R-squared | 0.984 | 0.984 |
| Adj. R-squared | 0.983 | 0.984 |
| MAPE | 1.922 | 1.921 |
olsmodel_final = sm.OLS(y_train, x_train2).fit()
print(olsmodel_final.summary())
OLS Regression Results
==============================================================================
Dep. Variable: used_price_log R-squared: 0.985
Model: OLS Adj. R-squared: 0.985
Method: Least Squares F-statistic: 1.312e+04
Date: Thu, 21 Oct 2021 Prob (F-statistic): 0.00
Time: 18:29:16 Log-Likelihood: 2093.7
No. Observations: 2143 AIC: -4163.
Df Residuals: 2131 BIC: -4095.
Df Model: 11
Covariance Type: nonrobust
=====================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------
const -8.3329 3.450 -2.415 0.016 -15.099 -1.567
screen_size 0.0017 0.001 2.270 0.023 0.000 0.003
selfie_camera_mp 0.0014 0.001 2.083 0.037 8.3e-05 0.003
release_year 0.0041 0.002 2.372 0.018 0.001 0.007
days_used -0.0011 1.2e-05 -88.620 0.000 -0.001 -0.001
new_price_log 0.9929 0.004 265.860 0.000 0.986 1.000
brand_name_Celkon 0.0410 0.019 2.173 0.030 0.004 0.078
brand_name_Gionee -0.0308 0.015 -2.026 0.043 -0.061 -0.001
brand_name_Realme 0.0476 0.021 2.321 0.020 0.007 0.088
brand_name_Sony 0.0282 0.013 2.147 0.032 0.002 0.054
5g_yes -0.0488 0.017 -2.811 0.005 -0.083 -0.015
bintotalmem_gold -0.1655 0.042 -3.971 0.000 -0.247 -0.084
==============================================================================
Omnibus: 287.881 Durbin-Watson: 2.044
Prob(Omnibus): 0.000 Jarque-Bera (JB): 717.980
Skew: -0.751 Prob(JB): 1.24e-156
Kurtosis: 5.406 Cond. No. 3.74e+06
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.74e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
# Read in dataset
df = pd.read_csv("used_phone_data.csv")
sns.histplot(data=df, x="used_price", kde=True)
plt.title("distribution of used phone prices")
plt.show()
sns.distplot(df["used_price"], kde=False, rug=True)
<AxesSubplot:xlabel='used_price'>
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
sns.set(rc={'figure.figsize':(11.7,8.27)})
labeled_barplot(df, "os", perc=True)
sns.distplot(df["ram"], color="hotpink")
plt.show()
sns.boxplot(df["ram"], orient="v")
plt.show()
df.groupby("brand_name")["ram"].nunique().plot(kind="bar")
plt.show()
osbycount = pd.pivot_table(
df, index=["brand_name",], values="ram", aggfunc={len, np.mean, np.min, np.max}
)
print(osbycount)
amax amin len mean brand_name Acer 4.000 1.000 51.000 3.902 Alcatel 4.000 0.250 125.000 3.426 Apple 6.000 2.000 59.000 4.000 Asus 8.000 2.000 126.000 4.048 BlackBerry 4.000 0.250 22.000 3.830 Celkon 4.000 0.250 37.000 1.466 Coolpad 4.000 3.000 22.000 3.955 Gionee 4.000 0.250 56.000 3.933 Google 6.000 4.000 15.000 4.533 HTC 6.000 3.000 110.000 4.000 Honor 8.000 2.000 118.000 4.593 Huawei 16.000 0.030 264.000 4.641 Infinix 4.000 2.000 10.000 2.600 Karbonn 4.000 0.250 30.000 3.375 LG 8.000 0.250 212.000 3.894 Lava 4.000 0.250 36.000 3.278 Lenovo 6.000 0.250 172.000 3.887 Meizu 8.000 2.000 62.000 4.452 Micromax 4.000 0.250 120.000 3.750 Microsoft 4.000 4.000 22.000 4.000 Motorola 12.000 2.000 110.000 3.945 Nokia 16.000 0.250 121.000 3.601 OnePlus 12.000 4.000 22.000 6.364 Oppo 12.000 1.000 129.000 4.961 Others 8.000 0.030 509.000 3.751 Panasonic 4.000 4.000 47.000 4.000 Realme 6.000 2.000 41.000 4.195 Samsung 12.000 0.250 364.000 4.159 Sony 8.000 4.000 88.000 4.068 Spice 4.000 0.250 30.000 3.750 Vivo 8.000 0.500 117.000 4.756 XOLO 4.000 4.000 49.000 4.000 Xiaomi 12.000 1.000 134.000 4.567 ZTE 8.000 0.250 141.000 4.023
# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# Distrabution of All Ram across all phones.
histogram_boxplot(df, "ram")
# Relationship between brand and ram type
plt.figure(figsize=(15, 7))
sns.boxplot(x="brand_name", y="ram", data=df, palette="GnBu")
plt.xticks(rotation=60)
plt.show()
table1 = pd.pivot_table(df, index=["brand_name", "ram"], aggfunc={len, np.mean})
table1
| battery | days_used | int_memory | main_camera_mp | new_price | release_year | screen_size | selfie_camera_mp | used_price | weight | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| len | mean | len | mean | len | mean | len | mean | len | mean | len | mean | len | mean | len | mean | len | mean | len | mean | ||
| brand_name | ram | ||||||||||||||||||||
| Acer | 1.000 | 1.000 | 2700.000 | 1.000 | 578.000 | 1.000 | 16.000 | 1.000 | 5.000 | 1.000 | 192.050 | 1.000 | 2014.000 | 1.000 | 25.560 | 1.000 | 2.000 | 1.000 | 96.190 | 1.000 | 508.000 |
| 2.000 | 1.000 | 2700.000 | 1.000 | 918.000 | 1.000 | 32.000 | 1.000 | 5.000 | 1.000 | 230.050 | 1.000 | 2014.000 | 1.000 | 25.560 | 1.000 | 2.000 | 1.000 | 68.980 | 1.000 | 508.000 | |
| 4.000 | 49.000 | 3011.837 | 49.000 | 797.878 | 49.000 | 23.184 | 49.000 | 7.048 | 49.000 | 171.309 | 49.000 | 2014.388 | 49.000 | 14.824 | 49.000 | 2.494 | 49.000 | 63.032 | 49.000 | 214.357 | |
| Alcatel | 0.250 | 7.000 | 1360.000 | 7.000 | 811.571 | 7.000 | 278.857 | 7.000 | 1.821 | 7.000 | 59.837 | 7.000 | 2013.714 | 7.000 | 7.641 | 7.000 | 0.300 | 7.000 | 19.624 | 7.000 | 101.129 |
| 0.500 | 3.000 | 3103.333 | 3.000 | 725.333 | 3.000 | 12.000 | 3.000 | 0.867 | 3.000 | 77.013 | 3.000 | 2015.667 | 3.000 | 14.607 | 3.000 | 1.867 | 3.000 | 31.447 | 3.000 | 233.333 | |
| 1.000 | 10.000 | 2820.000 | 10.000 | 826.900 | 10.000 | 7.600 | 10.000 | 2.760 | 10.000 | 129.681 | 10.000 | 2014.600 | 10.000 | 17.859 | 10.000 | 1.110 | 10.000 | 45.906 | 10.000 | 313.130 | |
| 2.000 | 2.000 | 4000.000 | 2.000 | 162.500 | 2.000 | 24.000 | 2.000 | 13.000 | 2.000 | 87.078 | 2.000 | 2020.000 | 2.000 | 16.110 | 2.000 | 5.000 | 2.000 | 63.275 | 2.000 | 164.000 | |
| 3.000 | 1.000 | 4000.000 | 1.000 | 201.000 | 1.000 | 32.000 | 1.000 | 13.000 | 1.000 | 92.650 | 1.000 | 2020.000 | 1.000 | 18.730 | 1.000 | 5.000 | 1.000 | 69.810 | 1.000 | 165.000 | |
| 4.000 | 102.000 | 2354.314 | 102.000 | 759.088 | 102.000 | 23.373 | 102.000 | 6.948 | 102.000 | 152.930 | 102.000 | 2014.931 | 102.000 | 12.884 | 102.000 | 2.439 | 102.000 | 59.104 | 102.000 | 154.667 | |
| Apple | 2.000 | 1.000 | 5124.000 | 1.000 | 827.000 | 1.000 | 16.000 | 1.000 | 8.000 | 1.000 | 362.060 | 1.000 | 2015.000 | 1.000 | 19.210 | 1.000 | 1.200 | 1.000 | 108.660 | 1.000 | 299.000 |
| 3.000 | 2.000 | 5324.000 | 2.000 | 274.000 | 2.000 | 48.000 | 2.000 | 10.000 | 2.000 | 352.005 | 2.000 | 2019.500 | 2.000 | 18.495 | 2.000 | 4.100 | 2.000 | 245.545 | 2.000 | 315.500 | |
| 4.000 | 54.000 | 3262.611 | 54.000 | 677.056 | 54.000 | 74.370 | 54.000 | 10.056 | 54.000 | 645.103 | 54.000 | 2016.352 | 54.000 | 15.654 | 54.000 | 4.915 | 54.000 | 284.261 | 54.000 | 208.287 | |
| 6.000 | 2.000 | 7745.250 | 2.000 | 234.000 | 2.000 | 128.000 | 2.000 | 12.000 | 2.000 | 899.835 | 2.000 | 2020.000 | 2.000 | 29.925 | 2.000 | 7.000 | 2.000 | 675.320 | 2.000 | 556.000 | |
| Asus | 2.000 | 2.000 | 3265.000 | 2.000 | 956.500 | 2.000 | 16.000 | 2.000 | 13.000 | 2.000 | 225.850 | 2.000 | 2014.000 | 2.000 | 15.240 | 2.000 | 2.000 | 2.000 | 67.795 | 2.000 | 196.000 |
| 4.000 | 121.000 | 3362.488 | 121.000 | 752.554 | 121.000 | 49.851 | 121.000 | 9.934 | 121.000 | 238.324 | 121.000 | 2015.314 | 121.000 | 15.435 | 121.000 | 4.357 | 121.000 | 94.476 | 121.000 | 215.784 | |
| 6.000 | 1.000 | 5000.000 | 1.000 | 232.000 | 1.000 | 64.000 | 1.000 | NaN | 1.000 | 654.500 | 1.000 | 2019.000 | 1.000 | 15.880 | 1.000 | 8.000 | 1.000 | 490.960 | 1.000 | 190.000 | |
| 8.000 | 2.000 | 5000.000 | 2.000 | 433.000 | 2.000 | 128.000 | 2.000 | 13.000 | 2.000 | 841.232 | 2.000 | 2018.500 | 2.000 | 19.925 | 2.000 | 16.000 | 2.000 | 508.145 | 2.000 | 220.000 | |
| BlackBerry | 0.250 | 1.000 | 1450.000 | 1.000 | 774.000 | 1.000 | 512.000 | 1.000 | 5.000 | 1.000 | 130.240 | 1.000 | 2013.000 | 1.000 | 6.350 | 1.000 | 2.000 | 1.000 | 52.170 | 1.000 | 120.000 |
| 4.000 | 21.000 | 2887.857 | 21.000 | 689.048 | 21.000 | 35.810 | 21.000 | 10.647 | 21.000 | 239.752 | 21.000 | 2015.286 | 21.000 | 12.700 | 21.000 | 5.290 | 21.000 | 104.414 | 21.000 | 162.500 | |
| Celkon | 0.250 | 25.000 | 1464.000 | 25.000 | 795.360 | 25.000 | 296.960 | 25.000 | 2.358 | 25.000 | 45.876 | 25.000 | 2013.040 | 25.000 | 8.730 | 25.000 | 0.340 | 25.000 | 16.739 | 25.000 | 137.332 |
| 4.000 | 12.000 | 1450.000 | 12.000 | 814.000 | 12.000 | 58.667 | 12.000 | 6.179 | 12.000 | 92.375 | 12.000 | 2013.667 | 12.000 | 11.270 | 12.000 | 1.250 | 12.000 | 34.960 | 12.000 | 130.767 | |
| Coolpad | 3.000 | 1.000 | 4000.000 | 1.000 | 342.000 | 1.000 | 32.000 | 1.000 | NaN | 1.000 | 161.610 | 1.000 | 2019.000 | 1.000 | 20.960 | 1.000 | 13.000 | 1.000 | 105.360 | 1.000 | 170.000 |
| 4.000 | 21.000 | 2894.762 | 21.000 | 786.429 | 21.000 | 36.571 | 21.000 | 11.263 | 21.000 | 195.841 | 21.000 | 2016.238 | 21.000 | 13.710 | 21.000 | 6.905 | 21.000 | 76.622 | 21.000 | 153.638 | |
| Gionee | 0.250 | 1.000 | 1500.000 | 1.000 | 602.000 | 1.000 | 512.000 | 1.000 | 2.000 | 1.000 | 60.290 | 1.000 | 2013.000 | 1.000 | 8.410 | 1.000 | 0.300 | 1.000 | 24.090 | 1.000 | 142.900 |
| 4.000 | 55.000 | 3147.364 | 55.000 | 785.927 | 55.000 | 35.200 | 55.000 | 10.057 | 55.000 | 258.926 | 55.000 | 2014.964 | 55.000 | 13.715 | 55.000 | 5.249 | 55.000 | 93.115 | 55.000 | 161.815 | |
| 4.000 | 11.000 | 3765.909 | 11.000 | 564.091 | 11.000 | 55.273 | 11.000 | 11.836 | 11.000 | 454.468 | 11.000 | 2017.545 | 11.000 | 15.167 | 11.000 | 7.333 | 11.000 | 218.852 | 11.000 | 191.455 | |
| 6.000 | 4.000 | 3250.000 | 4.000 | 395.750 | 4.000 | 64.000 | 4.000 | 12.200 | 4.000 | 530.400 | 4.000 | 2019.000 | 4.000 | 14.765 | 4.000 | 8.000 | 4.000 | 295.200 | 4.000 | 177.500 | |
| HTC | 3.000 | 4.000 | 3575.000 | 4.000 | 300.250 | 4.000 | 32.000 | 4.000 | 13.000 | 4.000 | 164.570 | 4.000 | 2019.000 | 4.000 | 15.560 | 4.000 | 12.000 | 4.000 | 111.405 | 4.000 | 165.000 |
| 4.000 | 104.000 | 2552.212 | 104.000 | 782.288 | 104.000 | 31.231 | 104.000 | 10.665 | 104.000 | 249.890 | 104.000 | 2014.952 | 104.000 | 12.823 | 104.000 | 5.332 | 104.000 | 96.058 | 104.000 | 154.396 | |
| 6.000 | 2.000 | 3930.000 | 2.000 | 345.500 | 2.000 | 128.000 | 2.000 | 13.000 | 2.000 | 418.545 | 2.000 | 2019.000 | 2.000 | 15.240 | 2.000 | 14.500 | 2.000 | 261.620 | 2.000 | 180.000 | |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| Samsung | 4.000 | 320.000 | 3104.991 | 320.000 | 747.916 | 320.000 | 37.850 | 320.000 | 9.096 | 320.000 | 273.897 | 320.000 | 2015.250 | 320.000 | 14.263 | 320.000 | 4.823 | 320.000 | 110.714 | 320.000 | 194.346 |
| 6.000 | 10.000 | 5158.000 | 10.000 | 269.400 | 10.000 | 121.600 | 10.000 | 9.600 | 10.000 | 472.473 | 10.000 | 2019.600 | 10.000 | 18.239 | 10.000 | 27.200 | 10.000 | 319.292 | 10.000 | 236.000 | |
| 8.000 | 6.000 | 3966.667 | 6.000 | 279.333 | 6.000 | 170.667 | 6.000 | 8.667 | 6.000 | 680.265 | 6.000 | 2019.833 | 6.000 | 15.982 | 6.000 | 15.167 | 6.000 | 463.262 | 6.000 | 175.517 | |
| 12.000 | 7.000 | 4362.143 | 7.000 | 257.143 | 7.000 | 219.429 | 7.000 | 9.714 | 7.000 | 948.278 | 7.000 | 2019.571 | 7.000 | 16.534 | 7.000 | 11.571 | 7.000 | 631.533 | 7.000 | 202.857 | |
| Sony | 4.000 | 86.000 | 2927.849 | 86.000 | 740.209 | 86.000 | 30.512 | 86.000 | 14.704 | 86.000 | 271.932 | 86.000 | 2015.244 | 86.000 | 13.894 | 86.000 | 5.322 | 86.000 | 111.593 | 86.000 | 178.371 |
| 6.000 | 1.000 | 3140.000 | 1.000 | 348.000 | 1.000 | 128.000 | 1.000 | NaN | 1.000 | 392.692 | 1.000 | 2019.000 | 1.000 | 15.400 | 1.000 | 8.000 | 1.000 | 254.390 | 1.000 | 164.000 | |
| 8.000 | 1.000 | 4000.000 | 1.000 | 236.000 | 1.000 | 256.000 | 1.000 | NaN | 1.000 | 934.150 | 1.000 | 2020.000 | 1.000 | 16.030 | 1.000 | 8.000 | 1.000 | 698.610 | 1.000 | 181.400 | |
| Spice | 0.250 | 2.000 | 2100.000 | 2.000 | 909.500 | 2.000 | 512.000 | 2.000 | 2.225 | 2.000 | 45.660 | 2.000 | 2013.000 | 2.000 | 9.285 | 2.000 | 0.800 | 2.000 | 15.770 | 2.000 | 158.000 |
| 4.000 | 28.000 | 2197.500 | 28.000 | 879.214 | 28.000 | 16.571 | 28.000 | 5.114 | 28.000 | 101.739 | 28.000 | 2013.250 | 28.000 | 11.922 | 28.000 | 1.629 | 28.000 | 33.480 | 28.000 | 161.354 | |
| Vivo | 0.500 | 1.000 | 1900.000 | 1.000 | 931.000 | 1.000 | 4.000 | 1.000 | 5.000 | 1.000 | 109.930 | 1.000 | 2013.000 | 1.000 | 10.950 | 1.000 | 2.000 | 1.000 | 33.010 | 1.000 | 130.000 |
| 2.000 | 2.000 | 4030.000 | 2.000 | 361.000 | 2.000 | 16.000 | 2.000 | 10.500 | 2.000 | 94.505 | 2.000 | 2019.000 | 2.000 | 18.730 | 2.000 | 5.000 | 2.000 | 57.985 | 2.000 | 163.500 | |
| 3.000 | 2.000 | 5000.000 | 2.000 | 275.500 | 2.000 | 48.000 | 2.000 | 13.000 | 2.000 | 172.090 | 2.000 | 2019.000 | 2.000 | 20.800 | 2.000 | 8.000 | 2.000 | 119.335 | 2.000 | 191.000 | |
| 4.000 | 83.000 | 3441.867 | 83.000 | 579.373 | 83.000 | 73.446 | 83.000 | 12.981 | 83.000 | 314.708 | 83.000 | 2017.349 | 83.000 | 17.633 | 83.000 | 12.606 | 83.000 | 144.570 | 83.000 | 165.249 | |
| 6.000 | 9.000 | 4393.333 | 9.000 | 306.333 | 9.000 | 113.778 | 9.000 | 13.000 | 9.000 | 305.032 | 9.000 | 2019.444 | 9.000 | 22.664 | 9.000 | 17.333 | 9.000 | 198.310 | 9.000 | 195.456 | |
| 8.000 | 20.000 | 4496.500 | 20.000 | 276.050 | 20.000 | 140.800 | 20.000 | 13.000 | 20.000 | 398.112 | 20.000 | 2019.400 | 20.000 | 23.461 | 20.000 | 22.800 | 20.000 | 275.330 | 20.000 | 195.860 | |
| XOLO | 4.000 | 49.000 | 2220.000 | 49.000 | 823.245 | 49.000 | 22.204 | 49.000 | 7.455 | 49.000 | 130.899 | 49.000 | 2013.653 | 49.000 | 12.482 | 49.000 | 1.718 | 49.000 | 47.340 | 49.000 | 151.019 |
| Xiaomi | 1.000 | 1.000 | 570.000 | 1.000 | 267.000 | 1.000 | 8.000 | 1.000 | 12.000 | 1.000 | 148.670 | 1.000 | 2019.000 | 1.000 | 14.920 | 1.000 | 20.000 | 1.000 | 115.800 | 1.000 | 44.000 |
| 2.000 | 4.000 | 4750.000 | 4.000 | 325.500 | 4.000 | 28.000 | 4.000 | 12.500 | 4.000 | 82.444 | 4.000 | 2019.500 | 4.000 | 19.008 | 4.000 | 7.250 | 4.000 | 54.957 | 4.000 | 182.250 | |
| 3.000 | 7.000 | 4434.286 | 7.000 | 381.429 | 7.000 | 36.571 | 7.000 | 12.000 | 7.000 | 121.000 | 7.000 | 2019.143 | 7.000 | 20.230 | 7.000 | 10.429 | 7.000 | 74.134 | 7.000 | 191.571 | |
| 4.000 | 89.000 | 3711.798 | 89.000 | 658.146 | 89.000 | 71.730 | 89.000 | 12.489 | 89.000 | 221.077 | 89.000 | 2016.764 | 89.000 | 17.239 | 89.000 | 9.179 | 89.000 | 99.985 | 89.000 | 176.378 | |
| 6.000 | 21.000 | 4373.333 | 21.000 | 269.286 | 21.000 | 82.286 | 21.000 | 12.000 | 21.000 | 259.173 | 21.000 | 2019.571 | 21.000 | 23.428 | 21.000 | 20.381 | 21.000 | 176.118 | 21.000 | 202.476 | |
| 8.000 | 11.000 | 4498.182 | 11.000 | 314.909 | 11.000 | 162.909 | 11.000 | 12.000 | 11.000 | 503.599 | 11.000 | 2019.545 | 11.000 | 23.106 | 11.000 | 20.364 | 11.000 | 338.146 | 11.000 | 210.273 | |
| 12.000 | 1.000 | 4050.000 | 1.000 | 478.000 | 1.000 | 512.000 | 1.000 | 12.000 | 1.000 | 2498.240 | 1.000 | 2019.000 | 1.000 | 32.390 | 1.000 | 20.000 | 1.000 | 1248.990 | 1.000 | 241.000 | |
| ZTE | 0.250 | 1.000 | 1400.000 | 1.000 | 1006.000 | 1.000 | 512.000 | 1.000 | 3.150 | 1.000 | 60.110 | 1.000 | 2014.000 | 1.000 | 10.160 | 1.000 | 5.000 | 1.000 | 17.990 | 1.000 | 120.000 |
| 1.000 | 2.000 | 2000.000 | 2.000 | 367.000 | 2.000 | 16.000 | 2.000 | 8.000 | 2.000 | 61.060 | 2.000 | 2019.000 | 2.000 | 12.700 | 2.000 | 5.000 | 2.000 | 40.560 | 2.000 | 144.000 | |
| 2.000 | 3.000 | 3000.000 | 3.000 | 496.333 | 3.000 | 21.333 | 3.000 | 14.000 | 3.000 | 103.787 | 3.000 | 2018.667 | 3.000 | 18.360 | 3.000 | 7.000 | 3.000 | 51.903 | 3.000 | 146.000 | |
| 3.000 | 3.000 | 3466.667 | 3.000 | 352.667 | 3.000 | 42.667 | 3.000 | 15.000 | 3.000 | 113.017 | 3.000 | 2019.000 | 3.000 | 15.877 | 3.000 | 9.667 | 3.000 | 73.227 | 3.000 | 163.333 | |
| 4.000 | 124.000 | 2776.411 | 124.000 | 772.202 | 124.000 | 37.419 | 124.000 | 11.068 | 124.000 | 205.505 | 124.000 | 2015.500 | 124.000 | 13.213 | 124.000 | 5.819 | 124.000 | 79.592 | 124.000 | 154.192 | |
| 6.000 | 5.000 | 4420.000 | 5.000 | 332.800 | 5.000 | 115.200 | 5.000 | 25.667 | 5.000 | 468.668 | 5.000 | 2019.400 | 5.000 | 23.686 | 5.000 | 15.200 | 5.000 | 301.092 | 5.000 | 190.800 | |
| 8.000 | 3.000 | 4866.667 | 3.000 | 234.667 | 3.000 | 170.667 | 3.000 | 48.000 | 3.000 | 636.793 | 3.000 | 2019.667 | 3.000 | 25.560 | 3.000 | 12.000 | 3.000 | 450.427 | 3.000 | 216.000 | |
sns.lmplot(y="battery", x="weight", data=df)
<seaborn.axisgrid.FacetGrid at 0x7fa0bf03c7c0>
# Query the database for more than 4500 mAh
df_g = df.query("battery > 4500")
# Filter out labels of interest
# df_g = df_g.filter(["brand_name", "weight"])
df_g.describe().T
# Group the value according to a condition
# df_g.groupby('brand_name', 'weight').mean()
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| screen_size | 346.000 | 21.487 | 5.170 | 10.160 | 16.350 | 20.960 | 25.560 | 46.360 |
| main_camera_mp | 292.000 | 9.186 | 4.900 | 0.300 | 5.000 | 8.000 | 13.000 | 48.000 |
| selfie_camera_mp | 346.000 | 7.206 | 6.254 | 0.300 | 2.000 | 5.000 | 8.000 | 32.000 |
| int_memory | 346.000 | 59.145 | 86.464 | 16.000 | 16.000 | 32.000 | 64.000 | 1024.000 |
| ram | 346.000 | 4.246 | 1.467 | 1.000 | 4.000 | 4.000 | 4.000 | 12.000 |
| battery | 346.000 | 5884.103 | 1327.656 | 4520.000 | 5000.000 | 5100.000 | 6690.000 | 12000.000 |
| weight | 346.000 | 330.719 | 160.953 | 23.000 | 198.000 | 299.000 | 467.000 | 950.000 |
| release_year | 346.000 | 2017.113 | 2.456 | 2013.000 | 2015.000 | 2017.500 | 2019.000 | 2020.000 |
| days_used | 346.000 | 568.760 | 278.010 | 92.000 | 319.250 | 581.500 | 778.750 | 1089.000 |
| new_price | 346.000 | 307.462 | 200.482 | 80.820 | 180.143 | 250.265 | 351.098 | 1200.850 |
| used_price | 346.000 | 155.901 | 134.678 | 33.090 | 79.958 | 111.460 | 167.787 | 901.270 |
histogram_boxplot(df_g, "weight")
# taking input
num = float(input("Enter the distance measured in centimeter : "))
# converting from cms to inches
""" 1 inch = 2.54 centimeters"""
inc = num / 2.54
# printing the result
print("Distance in inch : ", inc)
Enter the distance measured in centimeter : 15.24 Distance in inch : 6.0
# Query the database for screen sizes > 6 inches (15.24 centimeters)
df_ss = df.query("screen_size > 15.24")
df_ss.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| screen_size | 1235.000 | 20.384 | 4.220 | 15.400 | 16.190 | 19.840 | 23.970 | 46.360 |
| main_camera_mp | 1062.000 | 9.430 | 4.812 | 0.300 | 5.000 | 10.500 | 13.000 | 48.000 |
| selfie_camera_mp | 1234.000 | 10.812 | 8.969 | 0.300 | 2.200 | 8.000 | 16.000 | 32.000 |
| int_memory | 1235.000 | 80.094 | 90.459 | 0.200 | 32.000 | 64.000 | 128.000 | 1024.000 |
| ram | 1235.000 | 4.518 | 1.833 | 0.250 | 4.000 | 4.000 | 4.000 | 12.000 |
| battery | 1233.000 | 4245.503 | 1306.243 | 1200.000 | 3500.000 | 4000.000 | 4500.000 | 12000.000 |
| weight | 1235.000 | 244.604 | 123.042 | 23.000 | 171.000 | 192.000 | 291.000 | 950.000 |
| release_year | 1235.000 | 2017.672 | 2.324 | 2013.000 | 2016.000 | 2019.000 | 2019.000 | 2020.000 |
| days_used | 1235.000 | 493.794 | 250.634 | 91.000 | 285.500 | 452.000 | 667.500 | 1090.000 |
| new_price | 1235.000 | 310.519 | 256.334 | 40.080 | 150.825 | 241.810 | 381.504 | 2560.200 |
| used_price | 1235.000 | 173.820 | 169.874 | 15.190 | 74.980 | 122.830 | 200.700 | 1916.540 |
histogram_boxplot(df_ss, "screen_size")
# Query the database for budget phones offering greater than 8MP
df_sc = df.query("selfie_camera_mp > 8")
df_sc.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| screen_size | 666.000 | 18.334 | 4.593 | 2.700 | 15.560 | 16.110 | 22.030 | 32.390 |
| main_camera_mp | 545.000 | 12.138 | 4.410 | 0.300 | 10.500 | 13.000 | 13.000 | 48.000 |
| selfie_camera_mp | 666.000 | 18.680 | 6.312 | 9.000 | 16.000 | 16.000 | 20.000 | 32.000 |
| int_memory | 666.000 | 108.240 | 83.033 | 4.000 | 64.000 | 128.000 | 128.000 | 1024.000 |
| ram | 666.000 | 5.266 | 2.089 | 0.030 | 4.000 | 4.000 | 6.000 | 16.000 |
| battery | 666.000 | 3828.559 | 776.013 | 230.000 | 3400.000 | 4000.000 | 4200.000 | 6000.000 |
| weight | 666.000 | 180.390 | 28.711 | 25.000 | 165.550 | 181.000 | 196.375 | 300.000 |
| release_year | 666.000 | 2018.631 | 1.234 | 2013.000 | 2018.000 | 2019.000 | 2019.000 | 2020.000 |
| days_used | 666.000 | 424.200 | 206.416 | 91.000 | 264.000 | 385.500 | 545.250 | 1091.000 |
| new_price | 666.000 | 387.318 | 291.621 | 99.700 | 218.923 | 299.340 | 459.870 | 2560.200 |
| used_price | 666.000 | 228.294 | 195.126 | 35.740 | 118.400 | 165.020 | 268.502 | 1916.540 |
histogram_boxplot(df_sc, "selfie_camera_mp")
# Calculate pairwise-correlation
matrix = df.corr()
# Create a mask
mask = np.triu(np.ones_like(matrix, dtype=bool))
# Create a custom diverging palette
cmap = sns.diverging_palette(250, 15, s=75, l=40, n=9, center="light", as_cmap=True)
plt.figure(figsize=(16, 12))
sns.heatmap(matrix, mask=mask, center=0, annot=True, fmt=".2f", square=True, cmap=cmap)
plt.show()